Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.
library(fivethirtyeight)
data(drinks)What are the variable types? Any missing values we should worry about?
glimpse(drinks)## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
#The variable types can be seen as below. For country it is character, for beer_serving, spirit_servings and wine_servings they are integer, for total_litres_of_pure_alcohol it is double.skim(drinks)| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
# No missing values as seen in the result below in column "n_missing".Make a plot that shows the top 25 beer consuming countries
#
top_25_beer <- drinks %>%
top_n(25,beer_servings)
ggplot(top_25_beer,aes(y=fct_reorder(country,beer_servings),x=beer_servings))+
geom_col(fill="green")+
labs(title = "Top 25 beer consuming countries",x = "Beer servings", y = "Country")+
NULLMake a plot that shows the top 25 wine consuming countries
top_25_wine <- drinks %>%
top_n(25,wine_servings)
ggplot(top_25_wine,aes(y=fct_reorder(country,wine_servings),x=wine_servings))+
geom_col(fill="red")+
labs(title = "Top 25 wine consuming countries",x = "Wine servings", y = "Country")+
NULLFinally, make a plot that shows the top 25 spirit consuming countries
top_25_spirit <- drinks %>%
top_n(25,spirit_servings)
ggplot(top_25_spirit,aes(y=fct_reorder(country,spirit_servings),x=spirit_servings))+
geom_col(fill="blue")+
labs(title = "Top 25 spirit consuming countries",x = "Spirit servings", y = "Country")+
NULLWhat can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).
Namibia has the highest beer consumption, which is unexpected. This maybe because it’s a former German colony, in which beer culture could have been established historically, leading to a higher beer consumption. France has the highest wine consumption, which is expected. It is a culture in France to associate every meal with a wine-pairing, and France is the second largest wine producing country in the world. Portugal has the second highest wine consumption, which is also expected, given its tradition for producing and consuming fortified wines, such as Port and Madeira. Grenada has the highest spirit consumption. This is probably due to the fact that spirits make up 66% of all alcohol consumption in this country.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
skim(movies)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
# There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast memebrs receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratingunique_movies <- movies %>%
distinct(title,genre,director,year,.keep_all=TRUE)new_unique_movies <- movies %>%
summarise("number of unique movies"=n()) count_movie_genre <- unique_movies %>%
group_by(genre) %>%
count(sort=TRUE)
colnames(count_movie_genre) <- c("Genre", "Number of Movies")
count_movie_genre %>%
kable()| Genre | Number of Movies |
|---|---|
| Comedy | 844 |
| Action | 719 |
| Drama | 484 |
| Adventure | 281 |
| Crime | 198 |
| Biography | 135 |
| Horror | 128 |
| Animation | 35 |
| Fantasy | 26 |
| Documentary | 25 |
| Mystery | 15 |
| Sci-Fi | 7 |
| Family | 3 |
| Musical | 2 |
| Romance | 2 |
| Western | 2 |
| Thriller | 1 |
gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending orderreturn_movie_genre <- unique_movies %>%
group_by(genre) %>%
summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>%
mutate(return_on_budget =
((mean_gross - mean_budget)/mean_budget)) %>%
arrange(-return_on_budget)
colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]", "Return on budget")
return_movie_genre %>%
kable()| Genre | Mean gross [$] | Mean budget [$] | Return on budget |
|---|---|---|---|
| Musical | 9.21e+07 | 3189500 | 27.871 |
| Family | 1.49e+08 | 14833333 | 9.056 |
| Western | 2.08e+07 | 3465000 | 5.009 |
| Documentary | 1.74e+07 | 5887852 | 1.947 |
| Horror | 3.78e+07 | 13804379 | 1.737 |
| Fantasy | 4.19e+07 | 18484615 | 1.267 |
| Comedy | 4.25e+07 | 24458506 | 0.737 |
| Mystery | 6.91e+07 | 41500000 | 0.665 |
| Animation | 9.84e+07 | 61701429 | 0.595 |
| Biography | 4.52e+07 | 28543696 | 0.584 |
| Adventure | 9.44e+07 | 64692313 | 0.458 |
| Drama | 3.68e+07 | 25832605 | 0.423 |
| Crime | 3.76e+07 | 26527405 | 0.417 |
| Romance | 3.13e+07 | 25107500 | 0.245 |
| Action | 8.63e+07 | 70774558 | 0.219 |
| Sci-Fi | 2.98e+07 | 27607143 | 0.079 |
| Thriller | 2.47e+03 | 300000 | -0.992 |
top_15_directors <- unique_movies %>%
select(director, gross) %>%
group_by(director) %>%
summarise(total_gross = sum(gross),
mean_gross = mean(gross),
median_gross = median(gross),
sd_gross = sd(gross)) %>%
top_n(15,total_gross) %>%
arrange(-total_gross)
colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]", "Median gross [$]", "Standard deviation of gross [$]")
top_15_directors %>%
kable()| Director | Total gross [$] | Mean gross [$] | Median gross [$] | Standard deviation of gross [$] |
|---|---|---|---|---|
| Steven Spielberg | 4.01e+09 | 1.75e+08 | 1.64e+08 | 1.01e+08 |
| Michael Bay | 2.20e+09 | 1.83e+08 | 1.68e+08 | 1.26e+08 |
| James Cameron | 1.91e+09 | 3.18e+08 | 1.76e+08 | 3.09e+08 |
| Christopher Nolan | 1.81e+09 | 2.27e+08 | 1.97e+08 | 1.87e+08 |
| George Lucas | 1.74e+09 | 3.48e+08 | 3.80e+08 | 1.46e+08 |
| Robert Zemeckis | 1.62e+09 | 1.25e+08 | 1.01e+08 | 9.13e+07 |
| Tim Burton | 1.56e+09 | 1.11e+08 | 6.98e+07 | 9.93e+07 |
| Sam Raimi | 1.44e+09 | 1.80e+08 | 1.38e+08 | 1.75e+08 |
| Clint Eastwood | 1.38e+09 | 7.25e+07 | 4.67e+07 | 7.55e+07 |
| Francis Lawrence | 1.36e+09 | 2.72e+08 | 2.82e+08 | 1.35e+08 |
| Ron Howard | 1.34e+09 | 1.11e+08 | 1.02e+08 | 8.19e+07 |
| Gore Verbinski | 1.33e+09 | 1.90e+08 | 1.23e+08 | 1.54e+08 |
| Andrew Adamson | 1.14e+09 | 2.84e+08 | 2.80e+08 | 1.21e+08 |
| Shawn Levy | 1.13e+09 | 1.03e+08 | 8.55e+07 | 6.55e+07 |
| Ridley Scott | 1.13e+09 | 8.06e+07 | 4.78e+07 | 6.88e+07 |
dist_by_genre <- unique_movies %>%
select(genre, rating) %>%
group_by(genre) %>%
summarise(mean_rating = mean(rating),
min_rating = min(rating),
max_rating = max(rating),
sd_rating = sd(rating))
colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating", "Standard deviation of rating")
dist_by_genre %>%
kable()| Genre | Mean rating | Min rating | Max rating | Standard deviation of rating |
|---|---|---|---|---|
| Action | 6.23 | 2.1 | 9.0 | 1.039 |
| Adventure | 6.51 | 2.3 | 8.6 | 1.106 |
| Animation | 6.65 | 4.5 | 8.0 | 0.968 |
| Biography | 7.11 | 4.5 | 8.9 | 0.760 |
| Comedy | 6.11 | 1.9 | 8.8 | 1.024 |
| Crime | 6.92 | 4.8 | 9.3 | 0.853 |
| Documentary | 6.66 | 1.6 | 8.5 | 1.767 |
| Drama | 6.74 | 2.1 | 8.8 | 0.915 |
| Family | 6.50 | 5.7 | 7.9 | 1.217 |
| Fantasy | 6.08 | 4.3 | 7.9 | 0.953 |
| Horror | 5.79 | 3.6 | 8.5 | 0.987 |
| Musical | 6.75 | 6.3 | 7.2 | 0.636 |
| Mystery | 6.84 | 4.6 | 8.5 | 0.910 |
| Romance | 6.65 | 6.2 | 7.1 | 0.636 |
| Sci-Fi | 6.66 | 5.0 | 8.2 | 1.094 |
| Thriller | 4.80 | 4.8 | 4.8 | NA |
| Western | 5.70 | 4.1 | 7.3 | 2.263 |
unique_movies %>%
ggplot(aes(x=rating)) +
geom_histogram() +
facet_wrap(vars(genre)) +
labs(title = "Rating statistics of each genre", x = "Rating", y = "Count") NULL## NULL
ggplot to answer the followinggross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?unique_movies %>%
ggplot(aes(y=gross, x=cast_facebook_likes)) +
geom_point() +
labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes", y = "Gross [$]")+
NULL#It's unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable "cast_facebook_likes", Y-axis is the dependent variable "gross". There is no obvious correlation between the two variables.gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.unique_movies %>%
ggplot(aes(y=gross, x=budget)) +
geom_point() +
labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
NULL#It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?unique_movies %>%
ggplot(aes(y=gross, x=rating)) +
geom_point() +
facet_wrap(vars(genre))+
labs(title = "Gross vs. the rating of the movie genre", x = "Rating", y = "Gross [$]")+
NULL#In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it's hard to determine what the relationship is between the two variables.You may find useful the material on finance data sources.
nyse <- read_csv(here::here("data","nyse.csv"))Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
companies_per_sector <- nyse %>%
group_by(sector) %>%
count(sort=TRUE)
colnames(companies_per_sector) <- c("Sector", "Number of Companies")
companies_per_sector %>%
kable()| Sector | Number of Companies |
|---|---|
| Finance | 97 |
| Consumer Services | 79 |
| Public Utilities | 60 |
| Capital Goods | 45 |
| Health Care | 45 |
| Energy | 42 |
| Technology | 40 |
| Basic Industries | 39 |
| Consumer Non-Durables | 31 |
| Miscellaneous | 12 |
| Transportation | 10 |
| Consumer Durables | 8 |
colnames(companies_per_sector) <- c("sector", "number_of_companies")
companies_per_sector %>%
ggplot(aes(reorder(sector,-number_of_companies), number_of_companies)) +
geom_bar(stat="identity")+
labs(title = "Number of companies in each sector", x = "Sector", y = "Number of Companies")+
NULLNext, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).
# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("AAPL","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2021-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 18,781
## Columns: 8
## Groups: symbol [7]
## $ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.1, 10.2, 10.2, 10.2, 10.3, 10.5, 10.5, 10.6, 10.6, 10.7, 1…
Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
summarise_monthly_returns <- myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(min_monthly_return = min(monthly_returns),
max_monthly_return = max(monthly_returns),
median_monthly_return = median(monthly_returns),
mean_monthly_return = mean(monthly_returns),
sd_monthly_return = sd(monthly_returns))
colnames(summarise_monthly_returns) <- c("Stock","Minimum Monthly Return [$]", "Maximum Monthly Return [$]", "Median of Monthly Return [$]", "Mean of Monthly Return [$]", "Standard Deviation of Monthly Return [$]")
summarise_monthly_returns %>%
kable()| Stock | Minimum Monthly Return [$] | Maximum Monthly Return [$] | Median of Monthly Return [$] | Mean of Monthly Return [$] | Standard Deviation of Monthly Return [$] |
|---|---|---|---|---|---|
| AAPL | -0.181 | 0.217 | 0.026 | 0.024 | 0.078 |
| ANF | -0.421 | 0.507 | 0.003 | 0.009 | 0.145 |
| DIS | -0.179 | 0.234 | 0.009 | 0.016 | 0.068 |
| DPZ | -0.188 | 0.342 | 0.025 | 0.031 | 0.075 |
| SPY | -0.125 | 0.127 | 0.017 | 0.012 | 0.038 |
| TSLA | -0.224 | 0.811 | 0.015 | 0.052 | 0.176 |
| XOM | -0.262 | 0.233 | 0.002 | 0.003 | 0.066 |
Plot a density plot, using geom_density(), for each of the stocks
myStocks_returns_monthly %>%
ggplot(aes(x=monthly_returns)) +
geom_density() +
facet_wrap(vars(symbol))+
labs(title = "Monthly return of each of the stocks", x = "Monthly return [$]", y = "Count")+
NULLWhat can you infer from this plot? Which stock is the riskiest? The least risky?
The risk of the stock can be determined by visually analyzing the proportion of the graph area that lies in the negative direction of the x-axis. This determines how the stocks were performing and whether they were likely to produce negative returns. Based on this, the riskiest stock would be XOM, because its past performance has shown that it often gives negative returns. The least risky stock would be DPZ, because its past performance indicates that it is likely to produce positive returns.
Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
library(ggrepel)
myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(mean_monthly_return = mean(monthly_returns),
sd_monthly_return = sd(monthly_returns)) %>%
ggplot(aes(y=sd_monthly_return, x=mean_monthly_return, label = symbol)) +
geom_point() +
geom_text_repel()+
labs(title = "The risk of each of the stocks expressed as standard deviation vs. the monthly return", x = "Monthly return [$]", y = "Standard deviation [$]")+
NULLWhat can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
Based on this graph, ANF had a broad distribution of monthly returns but not a higher expected return when compared to other stocks. Therefore it is of high risk but not of high expected return.
For this task, you will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.
First let us load the data
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:
attrition)leave_count <- hr_cleaned %>%
select(attrition) %>%
group_by(attrition) %>%
summarise(attrition_count = count(attrition))
leave_count %>%
ggplot(aes(x=attrition,y=attrition_count)) +
geom_col() +
labs(title = "Number of people who leave the company expressed as attrition", x = "Attrition", y = "Count")+
NULLleave_count[2,2]/sum(leave_count$attrition_count)## attrition_count
## 1 0.161
# 16.1% of the people in this survey leave the company.age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?employee_info <- hr_cleaned %>%
select(age, years_at_company, monthly_income, years_since_last_promotion)
age_plot <- ggplot(employee_info, aes(x=age)) +
geom_density() +
labs(title = "Density plot of employees' age", x = "Age", y = "Density")+
NULL
years_plot <- ggplot(employee_info, aes(x=years_at_company)) +
geom_density() +
labs(title = "Density plot of how long employees have stayed in the company", x = "Years at company", y = "Density")+
NULL
income_plot <- ggplot(employee_info, aes(x=monthly_income)) +
geom_density() +
labs(title = "Density plot of employees' monthly income", x = "Monthly income", y = "Density")+
NULL
promotion_plot <- ggplot(employee_info, aes(x=years_since_last_promotion)) +
geom_density() +
labs(title = "Density plot of how long it has been since employees' are last promoted", x = "Years since last promotion", y = "Density")+
NULL
ggarrange(age_plot,years_plot,income_plot,promotion_plot, ncol=2, nrow=2)#By looking at the above plots, the "age_plot" is closer to normal distribution.job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of totaljob_satis_dist <- hr_cleaned %>%
select(job_satisfaction) %>%
group_by(job_satisfaction) %>%
summarise(cnt = n()) %>%
mutate(freq = percent(cnt / sum(cnt)))
job_satis_dist %>%
ggplot(aes(reorder(job_satisfaction,-cnt), cnt)) +
geom_bar(stat="identity")+
geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
labs(title = "Distribution of job satisfaction", x="Job satisfaction", y="Count")+
NULL work_life_dist <- hr_cleaned %>%
select(work_life_balance) %>%
group_by(work_life_balance) %>%
summarise(cnt = n()) %>%
mutate(freq = percent(cnt / sum(cnt)))
work_life_dist %>%
ggplot(aes(reorder(work_life_balance,-cnt), cnt)) +
geom_bar(stat="identity")+
geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
labs(title = "Distribution of work life balance", x="Work life balance", y="Count")+
NULL hr_cleaned %>%
select(education,monthly_income) %>%
ggplot(aes(x=education, y=monthly_income)) +
geom_boxplot()+
labs(title = "Relationship between monthly income and education", x = "Education", y = "Monthly income") +
NULLincome_gender <- hr_cleaned %>%
group_by(gender) %>%
ggplot(aes(x=gender, y=monthly_income)) +
geom_boxplot()+
labs(title = "Relationship between monthly income and gender", x = "Gender", y = "Monthly income") +
NULL
income_genderhr_cleaned %>%
mutate(job_role = fct_rev(fct_reorder(job_role, monthly_income, median))) %>%
ggplot(aes(x=job_role, y=monthly_income)) +
geom_boxplot()+
labs(title = "Relationship between monthly income and job role", x = "Job role", y = "Mean monthly income") +
NULLincome_edu <- hr_cleaned %>%
group_by(education) %>%
summarise(mean_income = mean(monthly_income)) %>%
ggplot(aes(x=education, y=mean_income)) +
geom_col()+
labs(title = "Relationship between monthly income and education", x = "Education", y = "Mean monthly income") +
NULLggthemeshr_cleaned %>%
ggplot(aes(x=monthly_income)) +
ggthemes::theme_wsj() +
geom_density() +
facet_wrap(vars(education)) +
labs(title = "Distribution of monthly
income by education level", x = "Education", y = "Density") +
NULLjob_rolehr_cleaned %>%
ggplot(aes(y=monthly_income, x=age)) +
geom_point() +
facet_wrap(vars(job_role))+
labs(title = "Relationship between income and age", x = "Age", y = "Monthly income") +
NULLThe purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.
First we look at the data if there’s missing values.
skim(election2020_results)| Name | election2020_results |
| Number of rows | 22093 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 51 | 0 |
| state_po | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| county_name | 0 | 1 | 3 | 21 | 0 | 1865 | 0 |
| fips | 9 | 1 | 5 | 5 | 0 | 3153 | 0 |
| office | 0 | 1 | 9 | 12 | 0 | 2 | 0 |
| candidate | 0 | 1 | 5 | 17 | 0 | 4 | 0 |
| party | 0 | 1 | 5 | 11 | 0 | 5 | 0 |
| mode | 0 | 1 | 4 | 20 | 0 | 16 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2020 | 0 | 2020 | 2020 | 2020 | 2020 | 2020 | ▁▁▇▁▁ |
| candidatevotes | 1 | 1 | 7173 | 42426 | 0 | 15 | 177 | 2341 | 3028885 | ▇▁▁▁▁ |
| totalvotes | 5 | 1 | 50118 | 140287 | 66 | 6190 | 13669 | 36638 | 4264365 | ▇▁▁▁▁ |
| version | 0 | 1 | 20210622 | 0 | 20210622 | 20210622 | 20210622 | 20210622 | 20210622 | ▁▁▇▁▁ |
#We can see that there are 9 missing values. We need to clean up the missing values.
#remove missing values
election2020_results <- na.omit(election2020_results)Then, calculate the rate of votes for Trump in each county.
vote_data <- election2020_results %>%
select(county_name, fips, candidate, candidatevotes, totalvotes) %>%
filter(candidate == "DONALD J TRUMP") %>%
group_by(county_name, fips, totalvotes) %>%
summarise(vote_sum = sum(candidatevotes)) %>%
mutate(vote_rate = vote_sum/totalvotes)Then, filter the most recent data from vaccinations, in this case, september data.
sept_vaccinations <- vaccinations %>%
select(date, fips,series_complete_pop_pct) %>%
filter(date == "09/04/2021")
#After we filtered the dataset, we can conduct cleaning so that the code is more efficient regarding to a smaller dataset.
sept_vaccinations <- na.omit(sept_vaccinations) %>%
unique()Next, join the september vaccination percentage to the percentage of votes for each county by each candidate.
pct_trump_votes_vacc <- inner_join (vote_data,
sept_vaccinations ,
by="fips")Next, join the population size for each county into the percentage trump votes vaccination table.
votes_vacc_pop <- inner_join (pct_trump_votes_vacc,
population ,
by="fips")Then, plot the relationship between percentage of trump vote and percentage of population vaccinated.
vac_pop_plot <- votes_vacc_pop %>%
ggplot(aes(x=vote_rate,y=series_complete_pop_pct/100, size=pop_estimate_2019)) +
#fit a regression line to the plot
geom_smooth(method=lm) +
#aethetics for population size
geom_point(color = "black", alpha = 0.2) +
geom_point(size = 0.3)+
scale_size(range = c(0, 30)) +
#draw the dash lines
geom_hline(yintercept=0.85, linetype="dashed") +
geom_hline(yintercept=0.539, linetype="dashed") +
geom_hline(yintercept=0.508, linetype="dashed") +
#break the graph into 3x3 grids with solid lines
geom_hline(yintercept=0.4, linetype="solid") +
geom_hline(yintercept=0.6, linetype="solid") +
geom_vline(xintercept=0.4, linetype="solid") +
geom_vline(xintercept=0.6, linetype="solid") +
labs(title = "COVID 19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
x = "2020 Trump Vote %",
y = "% of Total Population Vaccinated") +
NULL
#add the text messages
vac_pop_plot + annotate("text", x = 0.2,y = 0.89, label = "Herd Immunity Threshold (?)",color = 'blue',size = 3) +
annotate("text", x = 0.05, y = 0.56, label = "TARGET:53.9%", fontface = 'italic', color = 'blue',size = 3) +
annotate("text", x = 0.05, y = 0.525, label = "ACTUAL:50.8%", fontface = 'italic', color = 'blue',size = 3)+
annotate("text", x = 0.3, y = 0.1, label = "09/04/2021", fontface = 'italic', color = 'red') +
annotate("text", x = 0.45, y = 1, label = "EVERY U.S. COUNTY", size = 6) +
annotate("text", x = 0.1, y = 0.06, label = deparse(bquote(~R^2 ==~ 0.4216)),
parse = T, size = 3, color = 'red')+
annotate("text", x = 0.1, y = 0.1, label = "y = -0.4224x + 0.6451", color = 'red',size = 3) +
scale_y_continuous(labels = scales::percent)+
scale_x_continuous(labels = scales::percent, limits = c(0,1))+
theme(legend.position="none")The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.
The following code will scrape the wikipedia page and import the table in a dataframe.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)glimpse(german_election_polls)## Rows: 214
## Columns: 16
## $ polling_firm <chr> "INSA", "Forschungsgruppe Wahlen", "Infratest dimap", "…
## $ fieldwork_date <chr> "30 Aug – 3 Sep 2021", "31 Aug – 2 Sep 2021", "30 Aug –…
## $ samplesize <chr> "1,427", "1,301", "1,337", "2,340", "1,729", "1,439", "…
## $ abs <chr> "–", "29", "–", "–", "–", "–", "–", "24", "–", "–", "27…
## $ union <dbl> 20.0, 22.0, 20.0, 19.5, 20.0, 21.0, 20.0, 21.0, 21.0, 2…
## $ spd <dbl> 25.0, 25.0, 25.0, 27.0, 25.0, 25.0, 25.0, 23.0, 25.0, 2…
## $ af_d <dbl> 12.0, 11.0, 12.0, 10.5, 12.0, 11.0, 11.0, 11.0, 11.0, 1…
## $ fdp <dbl> 13.0, 11.0, 13.0, 13.0, 13.0, 11.0, 13.5, 12.0, 11.0, 1…
## $ linke <dbl> 7.0, 7.0, 6.0, 6.5, 8.0, 7.0, 7.0, 6.0, 7.0, 6.0, 6.0, …
## $ grune <dbl> 16.0, 17.0, 16.0, 15.5, 15.0, 19.0, 16.5, 18.0, 19.0, 1…
## $ fw <chr> "–", "–", "–", "–", "–", "–", "–", "–", "–", "–", "3", …
## $ others <chr> "7", "7", "8", "8", "8", "6", "7", "9", "6", "8", "6", …
## $ lead <chr> "5", "3", "5", "7.5", "5", "4", "5", "2", "4", "3", "Ti…
## $ end_date <date> 2021-09-03, 2021-09-02, 2021-09-01, 2021-08-31, 2021-0…
## $ month <dbl> 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
## $ week <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 34, 34, 34, 34, 34, 34,…
#selecting the columns that are useful
selected_data <- german_election_polls %>%
select (samplesize, union, spd, af_d, fdp, linke, grune, end_date)
#change "samplesize" to double
selected_data$samplesize <- as.numeric(gsub(",","",selected_data$samplesize))
election_data_raw <- selected_data %>%
mutate (number_of_vote_union = samplesize*union/100,
number_of_vote_spd = samplesize*spd/100,
number_of_vote_af_d = samplesize*af_d/100,
number_of_vote_fdp = samplesize*fdp/100,
number_of_vote_linke = samplesize*linke/100,
number_of_vote_grune = samplesize*grune/100) %>%
#calculate the number of votes for each party (sample size*percenrage of vote)
group_by(end_date) %>%
mutate (totalsample = sum(samplesize)) %>%
#the total sample size of the result given by each company on the same date
mutate (sum_vote_union = sum(number_of_vote_union),
sum_vote_spd = sum(number_of_vote_spd),
sum_vote_af_d = sum(number_of_vote_af_d),
sum_vote_fdp = sum(number_of_vote_fdp),
sum_vote_linke = sum(number_of_vote_linke),
sum_vote_grune = sum(number_of_vote_grune)) %>%
#calculate the average percentage of vote by using the total number of votes for the party / the total sample size
mutate (average_rate_union = sum_vote_union/totalsample,
average_rate_spd = sum_vote_spd/totalsample,
average_rate_af_d = sum_vote_af_d/totalsample,
average_rate_fdp = sum_vote_fdp/totalsample,
average_rate_linke = sum_vote_linke/totalsample,
average_rate_grune = sum_vote_grune/totalsample)
#remove the duplicated rows for the same date
election_data <- election_data_raw[!duplicated(election_data_raw$totalsample), ]election_data = na.omit(election_data)
roll_mean_union = zoo::rollmean(election_data$average_rate_union,k = 14, fill = NA, align = "left")
roll_mean_spd = zoo::rollmean(election_data$average_rate_spd,k = 14, fill = NA, align = "left")
roll_mean_af_d = zoo::rollmean(election_data$average_rate_af_d,k = 14, fill = NA, align = "left")
roll_mean_fdp = zoo::rollmean(election_data$average_rate_fdp,k = 14, fill = NA, align = "left")
roll_mean_linke = zoo::rollmean(election_data$average_rate_linke,k = 14, fill = NA, align = "left")
roll_mean_grune= zoo::rollmean(election_data$average_rate_grune,k = 14, fill = NA, align = "left")
election_data = cbind(election_data,
roll_mean_union=roll_mean_union,
roll_mean_spd=roll_mean_spd,
roll_mean_af_d=roll_mean_af_d,
roll_mean_fdp=roll_mean_fdp,
roll_mean_linke=roll_mean_linke,
roll_mean_grune=roll_mean_grune)colors = c("CDU/CSU" = "black", "SPD" = "red3", "AfD" = "midnightblue", "FDP" = "orange1",
"Linke" = "mediumvioletred", "Grune" = "green")
election_data %>%
ggplot(aes(x = end_date))+
geom_point(aes(y = average_rate_union), alpha=0.3, size=1) +
geom_point(aes(y = average_rate_spd), color="red3", alpha=0.3, size=1) +
geom_point(aes(y = average_rate_af_d), color="midnightblue", alpha=0.3, size=1)+
geom_point(aes(y = average_rate_fdp), color="orange1", alpha=0.3, size=1)+
geom_point(aes(y = average_rate_linke), color="mediumvioletred", alpha=0.3, size=1)+
geom_point(aes(y = average_rate_grune), color="green", alpha=0.3, size=1)+
geom_line(aes(y = roll_mean_union, color="CDU/CSU"), size=1) +
geom_line(aes(y = roll_mean_spd, color="SPD"),size=1) +
geom_line(aes(y = roll_mean_af_d, color="AfD"),size=1)+
geom_line(aes(y = roll_mean_fdp, color="FDP"),size=1)+
geom_line(aes(y = roll_mean_linke, color="Linke"),size=1)+
geom_line(aes(y = roll_mean_grune, color="Grune"),size=1)+
labs(title = "German Election Poll", x="Date", y="Percentage of vote")+
scale_y_continuous(labels = scales::percent)+
scale_x_date(labels = date_format("%B-%Y"))+
theme(legend.position = "right",legend.background = element_rect(fill="snow",
size=0.5, linetype="solid")) +
theme_bw()+
scale_color_manual(values = colors, name = "Parties")Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else? Yeah